Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web

نویسندگان

  • Keiji Shinzato
  • Satoshi Sekine
  • Naoki Yoshinaga
  • Kentaro Torisawa
چکیده

This paper describes an automatic dictionary construction method for Named Entity Recognition (NER) on specific domains such as restaurant guides. NER is the first step toward Information Extraction (IE), and we believe that such a dictionary construction method for NER is crucial for developing IE systems for a wide range of domains in the World Wide Web (WWW). One serious problem in NER on specific domains is that the performance of NER heavily depends on the amount of the training corpus, which requires much human labor to develop. We attempt to improve the performance of NER by using dictionaries automatically constructed from HTML documents instead of by preparing a large annotated corpus. Our dictionary construction method exploits the cooccurrence strength of two expressions in HTML itemizations calculated from average mutual information. Experimental results show that the constructed dictionaries improved the performance of the NER on a restaurant guide domain. Our method increased the F1-measure by 2.3 without any additional manual labor.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

Named Entity Recognition in Persian Text using Deep Learning

Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...

متن کامل

Domain Specific Named Entity Recognition (DSNER) from Web Documents

Named entity recognition is a tool, which use process natural language tasks such as, text categorization, speech translation, and document classification. The Web data promotes the idea, that more and more data can be interconnected. A step towards this goal is to bring more structured annotations to existing documents using common vocabularies or ontology. Semi-structured texts such as scient...

متن کامل

Named-Entity Recognition in Novel Domains with External Lexical Knowledge

We investigate the adaptation of structured classifiers to new domains. In particular, the problem of using a supervised Named-Entity Recognition (NER) system on data from a different source than the training data. We present a Semi-Markov Model, trained with the perceptron algorithm, coupled with an external dictionary with the goal of improving generalization on the novel domain. Preliminary ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006